Basic Text Analysis in R: WAIAI on-demand workshop

Introduction

What is text analysis?

Text analysis is quantitative analysis that is used to make unstructured text, such as reviews, open-ended survey responses, or customer tickets, usable for analysis in a range of settings. Text analysis results are typically observational, but can provide useful exploratory insights and support hypothesis generation.

How can basic text analysis be used to provide insight?

  1. Define question to be addressed

  2. Obtain text data after evaluating data sources, quality, and any ethical considerations

  3. Conduct exploratory data analyses

  4. Apply text analysis techniques, such as (but not limited to):

  • Sentiment analysis for quantifying positive and negative linguistic sentiment

  • Term frequency and keyword extraction to examine important terms

  • Topic modeling or cluster analyses to summarize themes

  • Text classification to categorize text into labels of interest

  1. Interpret and visualize results

  2. Formulate insights

Load packages

See all software references at the end of the tutorial.

#load packages and comment their uses
library(tidyverse) #data cleaning, organization, and visualization
library(psych) #summarize descriptive statistics and distributions
library(stringr) #data cleaning, organization, and visualization
library(skimr) #summarize descriptive statistics and distributions
library(tidytext) #sentiment analysis, text cleaning, and word frequency
library(textclean) #text cleaning and word frequency
library(wordcloud) #text cleaning and word frequency
library(tm) #word association
library(vader) #sentiment analysis
library(topicmodels) #topic modeling
library(MetBrewer) #plot color palettes

# color palette
MetBrew_Egypt <- MetBrewer::met.brewer("Egypt", n = 5)
MetBrew_Tam <- MetBrewer::met.brewer("Tam", n = 15)

Text data cleaning and exploration

Data source: random subset of 10,000 reviews (2% of dataset) from the Amazon Fine Foods reviews dataset (McAuley & Leskovec, 2013), learn more and download data here: https://snap.stanford.edu/data/web-FineFoods.html

#read data downloaded from Amazon Fine Foods reviews source
df_raw <- read.delim("../foods.txt", header = F) #load raw data saved to wd
df_raw$V1 <- iconv(df_raw$V1, from = "", to = "UTF-8") #convert all text to UTF-8

#cleaning raw dataframe to create one column per item within V1, one row per review using regex and dplyr
df <- df_raw %>%
  dplyr::mutate(V1 = str_split(V1, "\\n")) %>% 
  unnest(V1) %>%
  dplyr::mutate(V1 = str_trim(V1)) %>%
  dplyr::filter(V1 != "") %>%
  dplyr::mutate(V1 = str_split(V1, "(?=review/) | (?=product/)")) %>%
  unnest(V1) %>%
  dplyr::mutate(V1 = str_trim(V1)) %>%
  dplyr::filter(V1 != "")  %>%
  dplyr::mutate(names = str_extract(V1, "^review/\\w+|^product/\\w+"),
         values = str_remove(V1, "^review/\\w+:\\s*|^product/\\w+:\\s*")) %>%
  select(names, values) %>%
  dplyr::mutate(review_id = as.numeric(cumsum(names == "product/productId"))) %>%
  pivot_wider(names_from = names, values_from = values) %>%
  select(-c("review/time", "review/summary", "review/profileName")) %>%
  rename("productId" = "product/productId",
         "userId" = "review/userId",
         "score" = "review/score",
         "text" = "review/text") %>%
    dplyr::mutate(userId = match(userId, sample(unique(userId))),
                productId = match(productId, sample(unique(productId)))) %>%
  select(c("review_id", "productId", "userId", "score", "text")) %>%
  as.data.frame() 

#save new ids as categorical variables
df$review_id <- as.factor(df$review_id)
df$userId <- as.factor(df$userId)
df$productId <- as.factor(df$productId)

rm(df_raw) #remove giant raw dataset from environment

set.seed(22) #set seed to sample reproducibly
df <- df %>% 
  dplyr::slice_sample(n = 10000) %>% #take random set of 10000 reviews - 8 rows each
  unnest(c(2:5)) #unnest list cols

#replace "NULL" with NA
df[df=="NULL"] <- NA

#write.csv(df, "./foods_small.csv") ## optional: save small top 10000 reviews for future use
#df <- read.csv("./foods_small.csv") %>% select(-c(X)) #optional: read in saved data to save future pre-processing time

#examine text data
knitr::kable(skim(df)) #overview of data
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace factor.ordered factor.n_unique factor.top_counts
character score 3 0.9997000 3 3 0 5 0 NA NA NA
character text 19 0.9981002 33 6839 0 9814 0 NA NA NA
factor review_id 0 1.0000000 NA NA NA NA NA FALSE 10000 311: 2, 4: 1, 16: 1, 45: 1
factor productId 0 1.0000000 NA NA NA NA NA FALSE 3525 358: 82, 949: 80, 756: 76, 682: 72
factor userId 0 1.0000000 NA NA NA NA NA FALSE 9280 484: 7, 495: 7, 527: 7, 165: 6

Takeaway: We should have 5 columns in the “df” dataframe that represent a random subset of 10,000 Amazon Fine Foods review data that we will be using for this tutorial:

  • review_id (factor): review identifier (created during data cleaning based on row number)

  • productId (factor): product identifier for product being reviewed (replaced with sequential number ID during data cleaning)

  • userId (factor): reviewer identifier (replaced with sequential number ID during data cleaning)

  • score (character): number of stars given by the reviewer

  • text (character): complete review text

Question to be addressed: What are some common customer experiences and painpoints with Amazon Fine Foods products?

Exploratory data analysis

Distribution of review scores

#descriptive stats
psych::describe(df$score) #describe scores
##     vars    n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1*    1 9998 4.14 1.32      5    4.41   0   1   5     4 -1.36     0.46 0.01
#plot scores
df %>% 
  drop_na(score) %>% 
  ggplot(aes(x = as.factor(score))) +
  geom_bar(color = "darkgray", fill = "gray") +
  labs(title = "Frequency of review scores", 
       x = "Score", y = "Frequency") +
  theme_classic()

Takeaway: Most reviews have 5 stars.

Average number of reviews per product

df %>% 
  count(productId) %>% 
  psych::describe() #describe number of reviews per product
##            vars    n    mean      sd median trimmed     mad min  max range skew
## productId*    1 3525 4867.68 2815.71   4853 4861.48 3660.54   1 9768  9767 0.01
## n             2 3525    2.84    5.76      1    1.62    0.00   1   82    81 6.93
##            kurtosis    se
## productId*    -1.20 47.43
## n             62.69  0.10
#plot number of reviews per product
df %>% 
  count(productId)  %>%
  ggplot(aes(x = n)) +
  geom_histogram(color = "darkgray", fill = "gray", binwidth = 5) +
  labs(title = "Frequency of reviews per product", 
       x = "Review count per product", y = "Frequency") +
  theme_classic()

Takeaway: Most products have several reviews.

Comparing review scores of highly-reviewed and non-highly-reviewed products

#examine whether review scores vary among highly reviewed and non-highly-reviewed products
df <- df %>% 
  add_count(productId, name = "n_reviews") %>% #add column with count for number of reviews 
  dplyr::group_by(productId) %>%
  dplyr::mutate(highly_reviewed = ifelse(n_reviews > 2.84, "highly-reviewed", "not highly-reviewed")) %>% #create new column denoting if product is above/below mean of n reviews
  dplyr::ungroup() 

psych::describeBy(df$score, df$highly_reviewed) #examine whether scores differ on number of reviews
## 
##  Descriptive statistics by group 
## group: highly-reviewed
##     vars    n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1*    1 6684 5.13 1.29      6    5.39   0   1   6     5 -1.33     0.47 0.02
## ------------------------------------------------------------ 
## group: not highly-reviewed
##     vars    n mean   sd median trimmed mad min max range skew kurtosis   se
## X1*    1 3314 4.15 1.38      5    4.44   0   1   5     4 -1.4     0.44 0.02
#plot scores by high/low review
df %>% 
  drop_na(score) %>%
  ggplot(aes(x = as.factor(score))) +
  geom_bar(color = "darkgray", fill = "gray") +
  labs(title = "Frequency of review scores for highly-reviewed and not-highly-reviewed products",
       x = "Score", y = "Frequency") +
  theme_classic() +
  facet_wrap(~highly_reviewed)

Takeaway: Products that received more reviews than the mean and products that received fewer don’t differ substantially in terms of the distributions of stars given.

Average number of reviews per reviewer

df %>% 
  count(userId) %>% 
  psych::describe() #describe number of reviews per reviewer
##         vars    n     mean       sd median  trimmed      mad min   max range
## userId*    1 9280 28765.37 16478.39  28659 28773.47 21057.37   9 57220 57211
## n          2 9280     1.08     0.35      1     1.00     0.00   1     7     6
##         skew kurtosis     se
## userId* 0.01    -1.20 171.06
## n       6.50    60.94   0.00
#plot number of reviews per reviewer
df %>% 
  count(userId)  %>%
  ggplot(aes(x = n)) +
  geom_histogram(color = "darkgray", fill = "gray", binwidth = 1) +
  labs(title = "Frequency of reviews per reviewer", 
       x = "Review count per reviewer", y = "Frequency") +
  theme_classic()

Takeaway: Most reviewers reviewed only once in this subset.

Examine wordcount

#add wordcount column
df <- df %>%
  as.data.frame() %>%
  dplyr::group_by(review_id) %>%
  dplyr::mutate(review_wordcount = str_count(text, pattern = "\\w+")) %>% #add wordcount column using regex
  dplyr::ungroup() 
df$review_wordcount <- as.numeric(df$review_wordcount)

#examine results
psych::describe(df$review_wordcount) #describe wordcount
##    vars    n  mean   sd median trimmed mad min  max range skew kurtosis   se
## X1    1 9982 84.64 82.7     60   68.95  43   6 1318  1312 3.87     27.9 0.83
hist(df$review_wordcount, breaks = 100) #wordcount histogram

Takeaway: Review length is relatively short.

#plot review score and review length
df %>%
  subset(review_wordcount <= (84.64+(3*82.70))) %>% #subset to remove review length outliers more than 3 SD from average review length
  ggplot(aes(y = as.numeric(review_wordcount), x = as.factor(score), color = as.factor(score), fill = as.factor(score))) +
  geom_jitter(size = 1, color = "gray") +
  geom_violin(alpha = 0.7) +
  scale_fill_manual(values = MetBrew_Egypt) +
  scale_color_manual(values = MetBrew_Egypt) +
  geom_boxplot(width = 0.1, color = "white") +
  labs(x = "Review Score", y = "Review wordcount") +
  theme_classic() +
  theme(legend.position = "none") 

Takeaway: Review scores do not appear to vary much with length.

Text cleaning

  • Some analyses suggest text cleaning to reduce noise and improve analysis accuracy and efficiency, and should be used only when recommended.

  • Here, we will clean text by placing it in lowercase, removing extra whitespace, and replacing numeric characters with words (e.g., “1” -> “one”).

  • We will also create a version of the text data with further cleaning that is needed for some analyses, with:

    • punctuation removed and contractions replaced
    • lemmatization (i.e., changed to its morphological base, so “running” and “ran” become “run”).
    • stopwords removed (i.e., common words that do not often have universal meaning in some text analyses, e.g., “the,” “a”), and
    • tokenization (i.e., one token per row; here, one word per row)
  • Refer to best practices and documentation for specific analyses on which text cleaning steps are recommended (if any).

    • Consider whether an analysis is compositional (i.e., takes context into account vs. considering each word individually) before implementing text cleaning steps.

    • Sensitivity analyses can be a great tool to examine whether results change with and without text cleaning.

  • Be transparent about text cleaning steps taken when sharing results.

#placing all text in lowercase, replacing numbers with numeric in text, and removing extra white space to clean text up
df$text <- tolower(df$text) #place all text in lowercase
df$text <- trimws(df$text) #remove extra whitespace
df$text <- textclean::replace_number(df$text, remove = T) #replace numeric characters
df$text <- gsub("[0-9]+", "", df$text) #remove any remaining numbers altogether
df$text <- gsub("<[^>]*>", "", df$text) #remove css tags contained in < >

#see results
knitr::kable(head(df[, c("review_id", "text", "review_wordcount")], n = 5))
review_id text review_wordcount
24421 i have never been much of a soup person, unless down with a cold or too tired to cook up something fancy. once heated, i crumbled tortilla chips on top (or even some shredded cheese). since i love mexican food, this satisfied me for the evening. i was advised by progresso that they have discontinued this soup. thank goodness amazon had some for purchasing. 64
39808 the bars make for a nice snack and they taste natural but are not something that i would eat for pleasure. 21
56240 we really enjoyed this product – containes no sugar (contains dates and such) and it’s healthy while giving you the chocolate fix you need! 24
70928 the toy is excellent. my golden retriever is teething and this has been perfect for her.-stars because product description is not accurate. picture showed t-rex, but i received a brontosaurus. no big deal but if there are multiple models the description should indicate which one you’re getting 53
77000 type of pasta: orzo, it resembles more like rice than pasta, so this dish felt like a chicken,cheesy rice,broccoli type of dish as opposed to straight pasta.resembles: hamburger helper’s but much cheesierdifficulty: very easytime: minutesa box fills up: about people (maybe more if you eat less)amount of broccoli: very little, wish they had more! but yes, very little, though you can taste it. basically very small little flecks of freeze dried broccoli.amount of cheese: good amount of cheese for people!taste: b+ease to make: anutrition: c+/b-overall: b 120
df_words <- df #save separate lemmatized df
df_words$text <- textclean::replace_contraction(df_words$text) #replace contractions
df_words$text <- gsub("[[:punct:]]", "", df_words$text) #remove remaining punctuation
df_words$text <- textstem::lemmatize_strings(df_words$text) #lemmatize words

#tokenize
df_words <- df_words %>% 
  tidytext::unnest_tokens(word, text, token = "words") %>% 
  dplyr::filter(!is.na(word))

stopwords <- subset(stop_words, lexicon == "snowball") #select stopword lexicon

#remove stopwords from data
df_words <- df_words %>% 
  dplyr::filter(!word %in% stopwords$word) %>% 
  dplyr::filter(!is.na(word)) 

#see results
knitr::kable(head(df_words[, c("review_id", "word")], n = 5)) #top 5 rows
review_id word
24421 never
24421 much
24421 soup
24421 person
24421 unless

Basic text analysis

Examining frequent words and keywords

Most frequent words and bigrams

Most frequent words
#frequent word count dataframe
df_count <- df_words %>%
  count(word, sort = TRUE)

#show top frequent words table
knitr::kable(head(df_count, n = 25))
word n
good 7249
like 5261
much 4943
can 4746
taste 4636
flavor 3735
get 3336
one 3335
will 3214
love 3181
product 3121
make 3113
just 3086
try 2999
use 2901
great 2838
buy 2754
coffee 2742
food 2660
tea 2454
find 2339
dog 2317
eat 2297
little 2007
go 1883
## frequent words wordcloud
wordcloud(df$text, min.freq = 500, colors = brewer.pal(12, "Dark2"))

#compare most frequent words among high and low scored reviews
df_stars <- df %>%
  dplyr::select(c(review_id, score)) %>%
  dplyr::mutate(high_low_score = ifelse(score == "1.0" | score == "2.0", "low",
                                 ifelse(score == "5.0", "high", NA))) #create high/low review score column

#plot top 25 words in high and low score reviews
df_words %>%
  left_join(., df_stars, by = c("review_id")) %>%
  subset(!is.na(high_low_score)) %>%
  dplyr::group_by(high_low_score) %>%
  count(word, sort = TRUE) %>% 
  top_n(n = 25) %>% 
  dplyr::mutate(word = reorder(word, n)) %>% 
  dplyr::ungroup() %>%
    ggplot(aes(n, reorder_within(word, n, high_low_score))) + 
    geom_col(color = "gray", fill = "darkgray") +
    labs(y = "Word", x = "Frequency ", title = "Amazon fine foods reviews subset top 25 most frequent words",
         subtitle = "Among high-scoring (5 star) and low-scoring (1 or 2 star) reviews") +
    geom_text(aes(label = n), hjust = 1, colour = "white") +
    theme_classic() +
    facet_wrap(~high_low_score, scales = "free")
## Selecting by n

Takeaway: The top few words across all reviews are “good,” “like,” “much,” “can,” and “taste”. 5-star reviews and 1- and 2-star reviews feature fairly similar words, although each word is taken out of context here.

Sensitivity analysis: Compare to most frequent words without text cleaning
#frequent word count with raw text
df_count2 <- df %>%
  tidytext::unnest_tokens(word, text, token = "words") %>% #redo tokenization on raw text without cleaning
  dplyr::filter(!is.na(word)) %>%
  count(word, sort = TRUE) #count occurrences of raw words

#top frequent words table
knitr::kable(head(df_count2, n = 25)) #show top 25 words
word n
the 33258
i 25459
and 22868
a 21338
to 18097
it 15898
of 14419
is 13309
this 11731
for 9676
in 9596
my 7815
that 7647
but 6761
with 6308
not 6094
have 6089
you 5882
are 5803
was 5677
they 5309
as 4996
like 4757
on 4725
so 4536
Most frequent bigrams
#frequent bigram count
df_bigrams <- df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% #repeat tokenization where token = bigram
  dplyr::filter(!is.na(bigram)) %>%
  tidyr::separate(bigram, c("word1", "word2"), sep = " ") %>%
  dplyr::filter(!word1 %in% stopwords$word,
         !word2 %in% stopwords$word)  #remove stopwords

df_bigrams$word1 <- textstem::lemmatize_strings(df_bigrams$word1) #lemmatize words 
df_bigrams$word2 <- textstem::lemmatize_strings(df_bigrams$word2) #lemmatize words

#paste both words in bigram together in new column
df_bigrams <- df_bigrams %>%
  dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
  select(-c(word1, word2)) %>%
  count(bigram, sort = TRUE) 

#top frequent bigrams table
knitr::kable(head(df_bigrams, n = 25))
bigram n
taste like 459
k cup 425
highly recommend 322
dog food 321
green tea 319
gluten free 317
grocery store 294
peanut butter 238
much good 233
dog love 232
year old 230
taste good 217
taste great 211
really like 203
dark chocolate 202
cat food 176
really good 175
can get 171
great taste 171
great product 158
just like 154
potato chip 154
good price 152
look like 141
good taste 139
#compare high and low scored reviews
bigram_df_plot <- df %>%
  left_join(., df_stars, by = c("review_id")) %>%
  subset(!is.na(high_low_score)) %>%
  dplyr::group_by(high_low_score) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  dplyr::filter(!is.na(bigram)) %>%
  tidyr::separate(bigram, c("word1", "word2"), sep = " ") %>%
  dplyr::filter(!word1 %in% stopwords$word,
         !word2 %in% stopwords$word)  #remove stopwords

bigram_df_plot$word1 <- textstem::lemmatize_strings(bigram_df_plot$word1) #lemmatize words
bigram_df_plot$word2 <- textstem::lemmatize_strings(bigram_df_plot$word2) #lemmatize words

#plot top bigrams in high and low scoring reviews
bigram_df_plot %>%
  dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
  select(-c(word1, word2)) %>%
  count(bigram, sort = TRUE) %>% 
  top_n(n = 25) %>% 
  dplyr::mutate(bigram = reorder(bigram, n)) %>% 
  dplyr::ungroup() %>%
    ggplot(aes(n, reorder_within(bigram, n, high_low_score))) + 
    geom_col(color = "gray", fill = "darkgray") +
    labs(y = "Bigram", x = "Frequency ", title = "Amazon fine foods reviews top 25 most frequent bigrams",
         subtitle = "Among high-scoring (5 star) and low-scoring (1 or 2 star) reviews") +
    geom_text(aes(label = n), hjust = 1, colour = "white") +
    theme_classic() +
    facet_wrap(~high_low_score, scales = "free")
## Selecting by n

Takeaway: The top bigrams words across all reviews are “gluten free,” “green tea,” “dog food,” “highly recommend,” and “k cup”. 5-star and 1-and 2-star reviews feature some of these (e.g., “k cup,” “gluten free,” “dog food,” “green tea”), while 5-star reviews feature some more positive bigrams (“highly recommend,” “great product,” “really good,” “taste great”) 1- and 2-star reviews feature some more negative bigrams (“taste like,” “look like,” “never buy”).

Sensitivity analysis: Compare to most frequent bigrams without text cleaning
#frequent bigrams with raw text
df_bigrams2 <- df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% #tokenize bigrams without further cleaning
  dplyr::filter(!is.na(bigram)) %>%
  tidyr::separate(bigram, c("word1", "word2"), sep = " ") %>%
  dplyr::mutate(bigram = paste(word1, word2, sep = " ")) %>%
  select(-c(word1, word2)) %>%
  count(bigram, sort = TRUE)

#top frequent bigrams table
knitr::kable(head(df_bigrams2, n = 25))
bigram n
of the 2788
in the 2330
i have 2143
it is 2055
this is 2029
is a 1662
i was 1482
they are 1459
and i 1370
if you 1363
on the 1329
and the 1174
this product 1170
for a 1140
to be 1140
it was 1130
i am 1090
to the 1068
for the 1048
in a 1033
is the 974
the best 970
but i 926
a little 908
i would 895
Examine most frequent words via Document-Term Matrix and correlated terms
#create document-term matrix from tf-idf
df_dtm <- df_words %>%
  count(review_id, word, sort = TRUE) %>% #count words per review
  tidytext::bind_tf_idf(term = 'word',
              document = 'review_id',
              n = 'n') %>% #add term frequency info
  dplyr::filter(!is.na(tf)) %>%
  dplyr::select(review_id, word, n) %>%
  pivot_wider(id_cols = "review_id", names_from = "word", values_from = "n") %>% #pivot wider
  column_to_rownames(var='review_id') %>%
  as.data.frame()

#replace NA with 0 for DTM
df_dtm[is.na(df_dtm)] <- 0

#show top few rows to show example of DTM
knitr::kable(head(df_dtm[,c(1:20)], n = 5)) #select top few words in DTM 
thai tea mg dog coffee food can pad almond noodle rise cat gopher chip oil green capsule fructose treat juice
53927 53 0 0 0 0 3 3 24 0 10 0 0 0 0 1 3 0 0 0 0
36094 0 37 2 0 1 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0
59678 0 0 32 0 0 2 2 0 0 0 0 0 0 0 0 0 0 1 0 2
71779 0 0 0 32 0 28 4 0 0 0 0 1 0 0 0 0 0 0 1 0
36306 0 1 0 0 29 1 3 0 0 0 0 0 0 0 0 1 0 0 0 0
#cast DTM using tidytext cast_dtm
data_dtm <- df_words %>% 
  dplyr::count(review_id, word) %>% #count words per review
  dplyr::mutate(review_id = as.numeric(review_id)) %>%
  tidytext::cast_dtm(document = review_id, term = word, value = n) #cast document term matrix

#find frequent terms with over 1000 uses using tm package
findFreqTerms(data_dtm, 1000)
##  [1] "find"      "flavor"    "get"       "good"      "look"      "make"     
##  [7] "order"     "love"      "package"   "really"    "eat"       "just"     
## [13] "like"      "much"      "store"     "taste"     "add"       "little"   
## [19] "price"     "also"      "can"       "chocolate" "first"     "go"       
## [25] "know"      "one"       "sugar"     "think"     "time"      "treat"    
## [31] "use"       "will"      "year"      "great"     "dog"       "food"     
## [37] "give"      "try"       "buy"       "mix"       "recommend" "day"      
## [43] "tea"       "come"      "now"       "product"   "bite"      "coffee"   
## [49] "want"      "box"       "say"       "amazon"    "drink"     "even"     
## [55] "cup"       "bag"
#which words are correlated with "find"?
findAssocs(data_dtm, c("find"), (0.15))
## $find
##    can  store   much   hard amazon 
##   0.20   0.18   0.16   0.16   0.16

Takeaway: Across all reviews, the most common terms by word frequency are “find,” “flavor,” “get,” “good,” and “look”. “Find” is most correlated with “will,” “store,” “much,” “cheap,” and “amazon”.

#frequent and associated terms for 5-star reviews
data_dtm_high <- df_words %>% 
  subset(score == "5.0") %>%
  dplyr::count(review_id, word) %>%
  dplyr::mutate(review_id = as.numeric(review_id)) %>%
  tidytext::cast_dtm(document = review_id, term = word, value = n)

findFreqTerms(data_dtm_high, 500)
##  [1] "love"      "package"   "really"    "brand"     "eat"       "find"     
##  [7] "good"      "just"      "keep"      "like"      "much"      "store"    
## [13] "taste"     "buy"       "food"      "mix"       "recommend" "will"     
## [19] "day"       "give"      "tea"       "try"       "add"       "cat"      
## [25] "enjoy"     "get"       "water"     "year"      "also"      "can"      
## [31] "one"       "flavor"    "little"    "favorite"  "come"      "now"      
## [37] "order"     "product"   "work"      "great"     "make"      "go"       
## [43] "look"      "take"      "time"      "sweet"     "use"       "coffee"   
## [49] "want"      "first"     "think"     "amazon"    "drink"     "even"     
## [55] "pack"      "price"     "purchase"  "healthy"   "sugar"     "box"      
## [61] "know"      "say"       "cup"       "delicious" "chocolate" "chip"     
## [67] "bag"       "snack"     "need"      "dog"       "treat"
#which words are correlated with "find"?
findAssocs(data_dtm_high, c("find"), (0.15))
## $find
##    hard   store     can  amazon  appall finally grocery    much    year   local 
##    0.22    0.21    0.21    0.21    0.18    0.17    0.16    0.16    0.16    0.16 
## process  search    food 
##    0.16    0.16    0.15

Takeaway: The most common terms in 5-star reviews by word frequency are “love,” “package,” “really,” “brand,” and “eat”. “Find” is most correlated with “hard,” “can,” “store,” “amazon,” and “appall”.

#frequent and associated terms for 1 or 2-star reviews
data_dtm_low <- df_words %>% 
  subset(score == "1.0" | score == "2.0") %>%
  dplyr::count(review_id, word) %>%
  dplyr::mutate(review_id = as.numeric(review_id)) %>%
  tidytext::cast_dtm(document = review_id, term = word, value = n)

findFreqTerms(data_dtm_low, 300)
##  [1] "flavor"  "get"     "good"    "make"    "order"   "can"     "eat"    
##  [8] "go"      "like"    "much"    "one"     "taste"   "think"   "use"    
## [15] "will"    "just"    "buy"     "dog"     "food"    "product" "try"    
## [22] "even"    "coffee"
#which words are correlated with "find"?
findAssocs(data_dtm_low, c("find"), (0.15))
## $find
##            casei        dependant       physiology            react 
##             0.24             0.24             0.24             0.24 
##         rockstar themselvesupdate             tryi            annes 
##             0.24             0.24             0.24             0.24 
##            apear        guideline         honolulu          hygiene 
##             0.24             0.24             0.24             0.24 
##           inwhat        marketsit         regimine       substiture 
##             0.24             0.24             0.24             0.24 
##             usai             yoga          midwest           person 
##             0.24             0.24             0.23             0.21 
##           entire       reasonably           pickup            thier 
##             0.21             0.21             0.21             0.20 
##          success           seldom              try   challengecliff 
##             0.20             0.20             0.19             0.19 
##          deviate           dietas              ibs             road 
##             0.19             0.19             0.19             0.19 
##            rural       walmartcom       weightthis         strength 
##             0.19             0.19             0.19             0.19 
##      unavailable               cv     increasingly        tastewell 
##             0.19             0.19             0.19             0.19 
##           effect             much          request           review 
##             0.18             0.18             0.18             0.17 
##      restriction       substitute             bite          product 
##             0.17             0.17             0.16             0.16 
##             safe             line              lie             easy 
##             0.16             0.16             0.16             0.16 
##            towel             wary             youi          patient 
##             0.16             0.16             0.16             0.16 
##      outstanding            yield             size           severe 
##             0.16             0.16             0.15             0.15

Takeaway: The most common terms in 1 and 2-star reviews by word frequency are “get,” “good,” “flavor,” “make,” and “order”. “find” is most correlated with “casei,” “dependant,” and “physiology”, “react”, and “rockstar”.

Topic model (LDA)

We can summarize the topics discussed across all reviews using Latent Dirichlet Allocation (LDA) topic modeling (Blei, Ng, & Jordan, 2003). There are many approaches for topic modeling but LDA is a common one and effective to provide an overview of review content here.

#estimate an LDA topic model with 15 topics
set.seed(2025)
reviews_lda <- topicmodels::LDA(data_dtm, k = 15, list = control(seed = 2025)) #run topic model with 15 topics

#get top 15 terms per topic
knitr::kable(topicmodels::get_terms(reviews_lda, 15))
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15
product like good flavor food organic tea dog order sugar chocolate price almond coffee use
use taste make chip cat ingredient green treat box drink bar amazon salt cup hot
cereal can use good eat product good love product taste good buy nut good sauce
good good cook like dog fat flavor get package flavor taste store seed flavor popcorn
hair much mix taste can list taste chew arrive juice like good blue like good
like try like eat much food like one good fruit flavor find good taste make
bottle just noodle snack good much drink will receive like make can snack much oz
shampoo get just love like contain much give cookie water much product healthy one just
much think taste great love good bag good one sweet will order much strong taste
dry really can much one make ginger tooth will can dark much taste try will
oatmeal one much try year fiber try like make much just great diamond roast much
will bad pasta bag old protein make much time good eat get blood drink add
oil say easy peanut get rice love small can add mix love keep blend spice
work will free butter baby g cup can gift soda love ship great make get
try give gluten potato will gram one great love use great purchase eat bean pop
#visualize top 5 words per topic
tidytext::tidy(reviews_lda, matrix = "beta") %>%
  dplyr::group_by(topic) %>%
  top_n(5, beta) %>% #count words with highest beta per topic
  dplyr::ungroup() %>%
  ggplot(aes(x = beta, y = reorder_within(term, beta, topic), fill = as.factor(topic))) +
  geom_col() +
  scale_fill_manual(values = MetBrew_Tam) +
  scale_color_manual(values = MetBrew_Tam) +
  facet_wrap(~ topic, scales = "free") +
  labs(title = "Top 5 words per LDA topic across Amazon fine food reviews subset",
       x = "Probability of word belonging to topic (beta)", y = "Word") +
  theme_classic() +
  theme(legend.position = "none")

Takeaway: This subset of reviews is primarily about specific products or broader consumer experiences such as finding items, pricing, and shipping.

Sensitivity analysis: Compare to LDA done without text cleaning
#redoing DTM as shown above without text cleaning prior to tokenization
data_dtm2 <- df %>%
  tidytext::unnest_tokens(word, text, token = "words") %>% 
  dplyr::filter(!is.na(word)) %>%
  dplyr::count(review_id, word) %>%
  dplyr::mutate(review_id = as.numeric(review_id)) %>%
  tidytext::cast_dtm(document = review_id, term = word, value = n)

#estimate an LDA topic model with 15 topics
reviews_lda2 <- topicmodels::LDA(data_dtm2, k = 15, list = control(seed = 2025)) #run topic model

#get top 15 terms per topic
knitr::kable(topicmodels::get_terms(reviews_lda2, 15))
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15
a to i the the i a the the the it the and to a
of a a to a and i and i i to and i i i
i and it a it the of of and to and this is it this
it of in i and a is to is it the in my is the
to i and and my is them a to is of it not this in
is it this of for but the for for and this of to for but
the this the in to it for are this a i you of of of
and for my are in so are is was my my with that the it
was not but for this have these have a you was not it with with
for that is that on they my in of but that a in in as
you with of with but these was i my in for i they that is
with you to as that are they on you so them flavor like and have
tea food that so with that and with that that is is you these so
but in like have like to so you like at but as this as for
in dog for not they this have good one like in was these they and

Basic sentiment analysis

Using a compositional method (Valence Aware Dictionary and sEntiment Reasoner, i.e., VADER; Hutto & Gilbert, 2014)

#get VADER compound sentiment scores for each review using minimally-cleaned text data since it is compositional
df_vader <- df %>%
  dplyr::group_by(review_id) %>%
  dplyr::mutate(vader_output = vader::vader_df(text)) %>% #use vader package to assign sentiment values per review
  dplyr::mutate(sentiment_vader = vader_output$compound) %>% #create new column for compound vader sentiment score
  select(-c(vader_output)) %>% #remove extraneous vader output
  dplyr::ungroup() %>%
  as.data.frame()

#examine sentiment
knitr::kable(skim(df_vader))
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace factor.ordered factor.n_unique factor.top_counts numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character score 3 0.9997000 3 3 0 5 0 NA NA NA NA NA NA NA NA NA NA NA
character text 19 0.9981002 33 6608 0 9814 0 NA NA NA NA NA NA NA NA NA NA NA
character highly_reviewed 0 1.0000000 15 19 0 2 0 NA NA NA NA NA NA NA NA NA NA NA
factor review_id 0 1.0000000 NA NA NA NA NA FALSE 10000 311: 2, 4: 1, 16: 1, 45: 1 NA NA NA NA NA NA NA NA
factor productId 0 1.0000000 NA NA NA NA NA FALSE 3525 358: 82, 949: 80, 756: 76, 682: 72 NA NA NA NA NA NA NA NA
factor userId 0 1.0000000 NA NA NA NA NA FALSE 9280 484: 7, 495: 7, 527: 7, 165: 6 NA NA NA NA NA NA NA NA
numeric n_reviews 0 1.0000000 NA NA NA NA NA NA NA NA 14.5380462 19.0767863 1.000 2.000 5.000 22.000 82 ▇▂▁▁▁
numeric review_wordcount 19 0.9981002 NA NA NA NA NA NA NA NA 84.6415548 82.7039978 6.000 35.000 60.000 103.000 1318 ▇▁▁▁▁
numeric sentiment_vader 21 0.9979002 NA NA NA NA NA NA NA NA 0.6600209 0.4565488 -0.982 0.601 0.859 0.944 1 ▁▁▁▁▇
#histogram of sentiment
hist(df_vader$sentiment_vader)

#see what goes into compound sentiment breakdown
df_vader_breakdown <- df_vader %>%
  dplyr::group_by(review_id) %>%
  dplyr::mutate(vader_output = vader_df(text)) %>% #generate vader output
  dplyr::mutate(prop_positive = vader_output$pos, #create columns showing proportion of each review that are positive, negative, and neutral
         prop_neutral = vader_output$neu,
         prop_negative = vader_output$neg) %>%
  select(review_id, prop_positive, prop_neutral, prop_negative) %>%
  dplyr::ungroup() %>%
  pivot_longer(cols = c(prop_positive, prop_neutral, prop_negative), names_to = "sentiment_vader", values_to = "proportion") %>%
  dplyr::mutate(sentiment_vader = factor(sentiment_vader, levels = c("prop_negative","prop_neutral","prop_positive"), labels = c("negative", "neutral", "positive"))) %>% #relabeling negative, neutral, and positive proportions
  dplyr::filter(!(proportion < 0 | proportion > 1)) %>% #filter out miscalculated proportions
    drop_na(proportion)

#plot proportion of each review that is negative, positive, and neutral
df_vader_breakdown %>%
  ggplot(aes(x = as.factor(review_id), y = proportion, fill = as.factor(sentiment_vader))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("red3", "orange2", "gold")) +
  labs(title = "Proportions of negative, neutral, and positive VADER sentiment across reviews",
       subtitle = "These data underlie the compound VADER sentiment values.",
       x = "Review",
       y = "Proportion",
       fill = "Sentiment (VADER)") +
  ylim(0,1) +
  coord_flip() +
  theme_classic() +
  theme(legend.position = "bottom") +
  theme(axis.text.x=element_blank()) 

Takeaways: Reviews are mostly positive or neutral in their language used according to VADER.

#plot association between vader linguistic sentiment and review score
df_vader %>%
  drop_na(sentiment_vader, score) %>%
  ggplot(aes(y = as.numeric(sentiment_vader), x = as.factor(score), color = as.factor(score), fill = as.factor(score))) +
  geom_jitter(size = 1, color = "gray") +
  geom_violin(alpha = 0.7) +
  scale_fill_manual(values = MetBrew_Egypt) +
      scale_color_manual(values = MetBrew_Egypt) +
  geom_boxplot(width = 0.1, color = "white") +
  labs(x = "Review Score", y = "Review linguistic sentiment (VADER)") +
  theme_minimal() +
  theme(legend.position = "none") 

Takeaways: More positive linguistic sentiment via VADER is directionally associated with much better review scores.

#plot association between vader linguistic sentiment and review length
df_vader %>%
  subset(review_wordcount <= (84.64+(3*82.70))) %>% #subset to remove review length outliers more than 3 SD from average review length
  drop_na(sentiment_vader, review_wordcount) %>%
  ggplot(aes(y = as.numeric(sentiment_vader), x = as.numeric(review_wordcount))) +
  geom_jitter(size = 1, color = "gray") +
  geom_smooth(method = "loess") +
  labs(x = "Review Wordcount", y = "Review linguistic sentiment (VADER)") +
  theme_minimal() +
  theme(legend.position = "none") 
## `geom_smooth()` using formula = 'y ~ x'

Takeaways: Reviews have slightly more positive linguistic sentiment via VADER as they get longer, but are fairly positive in linguistic sentiment to begin with.

Method comparison: Using a sentiment lexicon (AFINN; Nielsen, 2011)

#select sentiment database - afinn
AFINN <- tidytext:: get_sentiments("afinn")

#join sentiment values to review data by word matches
df_AFINN <- df_words %>%
  inner_join(AFINN, by = c("word")) %>%
  rename("sentiment_AFINN" = "value")

#examine sentiment
knitr::kable(skim(df_AFINN))
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace factor.ordered factor.n_unique factor.top_counts numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character score 0 1 3 3 0 5 0 NA NA NA NA NA NA NA NA NA NA NA
character highly_reviewed 0 1 15 19 0 2 0 NA NA NA NA NA NA NA NA NA NA NA
character word 0 1 2 17 0 779 0 NA NA NA NA NA NA NA NA NA NA NA
factor review_id 0 1 NA NA NA NA NA FALSE 9777 363: 65, 539: 65, 484: 57, 360: 55 NA NA NA NA NA NA NA NA
factor productId 0 1 NA NA NA NA NA FALSE 3465 853: 502, 756: 458, 915: 449, 602: 424 NA NA NA NA NA NA NA NA
factor userId 0 1 NA NA NA NA NA FALSE 9070 590: 215, 412: 130, 135: 120, 527: 84 NA NA NA NA NA NA NA NA
numeric n_reviews 0 1 NA NA NA NA NA NA NA NA 15.666661 19.184105 1 2 6 23 82 ▇▂▁▁▁
numeric review_wordcount 0 1 NA NA NA NA NA NA NA NA 146.064745 148.515108 10 56 100 181 1318 ▇▁▁▁▁
numeric sentiment_AFINN 0 1 NA NA NA NA NA NA NA NA 1.481144 1.792796 -4 1 2 3 5 ▁▂▂▇▁
#histogram of sentiment
hist(df_AFINN$sentiment_AFINN)

#see distribution of average AFINN sentiment across reviews
df_AFINN %>%
  dplyr::group_by(review_id) %>%
  dplyr::mutate(average_sentiment_AFINN = mean(sentiment_AFINN),
         primary_sent_direction = ifelse(average_sentiment_AFINN > 0, "positive", "negative")) %>% #create columns for average sentiment per review and average direction positive or negative
  dplyr::ungroup() %>%
  select(c(review_id, average_sentiment_AFINN, primary_sent_direction)) %>%
  distinct() %>%
  ggplot(aes(x = average_sentiment_AFINN, y = reorder(as.factor(review_id), average_sentiment_AFINN), color = as.factor(primary_sent_direction))) + #plot average sentiment scores with color by average direction
  geom_col() +
  scale_color_manual(values = c("red3", "green3")) +
  labs(title = "Average AFINN sentiment distribution across reviews",
       y = "Review",
       x = "Average AFINN sentiment",
       fill = "Sentiment (VADER)") +
  theme_classic() +
    theme(axis.text.y=element_blank()) +
  theme(legend.position = "none")
## Ignoring unknown labels:
## • fill : "Sentiment (VADER)"

Takeaways: Reviews are much more positive than negative in their linguistic sentiment according to AFINN.

#plot association between AFINN sentiment and review score
df_AFINN %>%
  dplyr::group_by(review_id) %>%
  dplyr::mutate(average_sentiment_AFINN = mean(sentiment_AFINN)) %>%
  dplyr::ungroup() %>%
  select(c(review_id, average_sentiment_AFINN, score)) %>%
  distinct() %>%
  ggplot(aes(y = as.numeric(average_sentiment_AFINN), x = as.factor(score), color = as.factor(score), fill = as.factor(score))) +
  geom_jitter(size = 1, color = "gray") +
  geom_violin(alpha = 0.7) +
  scale_fill_manual(values = MetBrew_Egypt) +
      scale_color_manual(values = MetBrew_Egypt) +
  geom_boxplot(width = 0.1, color = "white") +
  labs(x = "Review Score", y = "Average linguistic sentiment (AFINN)") +
  theme_minimal() +
  theme(legend.position = "none") 

Takeaways: More positive linguistic sentiment via AFINN is directionally associated with slightly better review scores.

#plot association between AFINN sentiment and review length
df_AFINN %>%
  subset(review_wordcount <= (84.64+(3*82.70))) %>% #subset to remove review length outliers more than 3 SD from average review length
  dplyr::group_by(review_id) %>%
  dplyr::mutate(average_sentiment_AFINN = mean(sentiment_AFINN)) %>%
  dplyr::ungroup() %>%
  drop_na(average_sentiment_AFINN, review_wordcount) %>%
  ggplot(aes(y = as.numeric(average_sentiment_AFINN), x = as.numeric(review_wordcount))) +
  geom_jitter(size = 1, color = "gray") +
  geom_smooth(method = "loess") +
  labs(x = "Review Wordcount", y = "Review linguistic sentiment (AFINN)") +
  theme_minimal() +
  theme(legend.position = "none") 
## `geom_smooth()` using formula = 'y ~ x'

Takeaways: Reviews have similar linguistic sentiment via AFINN regardless of length.

What did we learn?

In this workshop, we:

  • Set up a text analysis plan in R

  • Learned about how and why to use different text cleaning steps

  • Examined common experiences and painpoints discussed in a random subset of Amazon fine foods reviews (McAuley & Leskovec, 2013) using basic text analyses and visualizations, including:

    • Evaluation of most frequent words, bigrams, and associations between words in all reviews and those with high vs. low stars

    • Summarization of themes using topic modeling (LDA)

    • Examination of linguistic sentiment via sentiment analysis

      • Comparison of multiple methods to highlight the importance of selecting use-case-appropriate methods
  • Developed meaningful business insights from a subset of unstructured review text data

Remember to carefully evaluate data sources and quality, clean text data in line with best practices, choose analyses that are best suited to your use case, and interpret results in context and keeping in mind that they are often observational.

Happy text analyzing!

References & Additional Resources

References

McAuley, J. & Leskovec, J. (2013). From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.

Porter, M. F. 2001. “Snowball: A Language for Stemming Algorithms.” https://snowballstem.org.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal Of Machine Learning Research, 3(4/5), 993-1022. https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Hutto, C., & Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the
International AAAI Conference on Web and Social Media, 8(1), 216-225. https://doi.org/10.1609/icwsm.v8i1.14550

Nielsen, F. Å. (2011) A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903

R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Wickham H (2023). stringr: Simple, Consistent Wrappers for Common String Operations. doi:10.32614/CRAN.package.stringr https://doi.org/10.32614/CRAN.package.stringr, R package version 1.5.1, https://CRAN.R-project.org/package=stringr.

Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. doi:10.32614/CRAN.package.skimr https://doi.org/10.32614/CRAN.package.skimr, R package version 2.1.5, https://CRAN.R-project.org/package=skimr.

Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037.

Rinker, T. W. (2018). textclean: Text Cleaning Tools version 0.9.3. Buffalo, New York. https://github.com/trinker/textclean

Fellows I (2018). wordcloud: Word Clouds. doi:10.32614/CRAN.package.wordcloud https://doi.org/10.32614/CRAN.package.wordcloud, R package version 2.6, https://CRAN.R-project.org/package=wordcloud

Feinerer I, Hornik K (2025). tm: Text Mining Package. doi:10.32614/CRAN.package.tm https://doi.org/10.32614/CRAN.package.tm, R package version 0.7-16, https://CRAN.R-project.org/package=tm.

Roehrick K (2020). vader: Valence Aware Dictionary and sEntiment Reasoner (VADER). doi:10.32614/CRAN.package.vader https://doi.org/10.32614/CRAN.package.vader, R package version 0.2.1, https://CRAN.R-project.org/package=vader.

Grün B, Hornik K (2011). “topicmodels: An R Package for Fitting Topic Models.” Journal of Statistical Software, 40(13), 1–30. doi:10.18637/jss.v040.i13.

Mills BR (2022). MetBrewer: Color Palettes Inspired by Works at the Metropolitan Museum of Art. doi:10.32614/CRAN.package.MetBrewer https://doi.org/10.32614/CRAN.package.MetBrewer, R package version 0.2.0, https://CRAN.R-project.org/package=MetBrewer.

Posit team (2025). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL http://www.posit.co/.

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.5.0 (2025-04-11)
##  os       macOS Sequoia 15.6
##  system   aarch64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/New_York
##  date     2025-12-12
##  pandoc   3.4 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
##  quarto   1.6.42 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/quarto
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package        * version date (UTC) lib source
##  base64enc        0.1-3   2015-07-28 [1] CRAN (R 4.5.0)
##  bslib            0.9.0   2025-01-30 [1] CRAN (R 4.5.0)
##  cachem           1.1.0   2024-05-16 [1] CRAN (R 4.5.0)
##  cli              3.6.5   2025-04-23 [1] CRAN (R 4.5.0)
##  data.table       1.17.2  2025-05-12 [1] CRAN (R 4.5.0)
##  dichromat        2.0-0.1 2022-05-02 [1] CRAN (R 4.5.0)
##  digest           0.6.37  2024-08-19 [1] CRAN (R 4.5.0)
##  dplyr          * 1.1.4   2023-11-17 [1] CRAN (R 4.5.0)
##  evaluate         1.0.3   2025-01-10 [1] CRAN (R 4.5.0)
##  farver           2.1.2   2024-05-13 [1] CRAN (R 4.5.0)
##  fastmap          1.2.0   2024-05-15 [1] CRAN (R 4.5.0)
##  forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.5.0)
##  fs               1.6.6   2025-04-12 [1] CRAN (R 4.5.0)
##  generics         0.1.4   2025-05-09 [1] CRAN (R 4.5.0)
##  ggplot2        * 4.0.0   2025-09-11 [1] CRAN (R 4.5.0)
##  glue             1.8.0   2024-09-30 [1] CRAN (R 4.5.0)
##  gtable           0.3.6   2024-10-25 [1] CRAN (R 4.5.0)
##  hms              1.1.3   2023-03-21 [1] CRAN (R 4.5.0)
##  htmltools        0.5.8.1 2024-04-04 [1] CRAN (R 4.5.0)
##  janeaustenr      1.0.0   2022-08-26 [1] CRAN (R 4.5.0)
##  jquerylib        0.1.4   2021-04-26 [1] CRAN (R 4.5.0)
##  jsonlite         2.0.0   2025-03-27 [1] CRAN (R 4.5.0)
##  knitr            1.50    2025-03-16 [1] CRAN (R 4.5.0)
##  koRpus           0.13-8  2021-05-17 [1] CRAN (R 4.5.0)
##  koRpus.lang.en   0.1-4   2020-10-24 [1] CRAN (R 4.5.0)
##  labeling         0.4.3   2023-08-29 [1] CRAN (R 4.5.0)
##  lattice          0.22-6  2024-03-20 [1] CRAN (R 4.5.0)
##  lexicon          1.2.1   2019-03-21 [1] CRAN (R 4.5.0)
##  lifecycle        1.0.4   2023-11-07 [1] CRAN (R 4.5.0)
##  lubridate      * 1.9.4   2024-12-08 [1] CRAN (R 4.5.0)
##  magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.5.0)
##  Matrix           1.7-3   2025-03-11 [1] CRAN (R 4.5.0)
##  MetBrewer      * 0.2.0   2022-03-21 [1] CRAN (R 4.5.0)
##  mgcv             1.9-1   2023-12-21 [1] CRAN (R 4.5.0)
##  mnormt           2.1.1   2022-09-26 [1] CRAN (R 4.5.0)
##  modeltools       0.2-24  2025-05-02 [1] CRAN (R 4.5.0)
##  nlme             3.1-168 2025-03-31 [1] CRAN (R 4.5.0)
##  NLP            * 0.3-2   2024-11-20 [1] CRAN (R 4.5.0)
##  pillar           1.11.0  2025-07-04 [1] CRAN (R 4.5.0)
##  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.5.0)
##  plyr             1.8.9   2023-10-02 [1] CRAN (R 4.5.0)
##  psych          * 2.5.3   2025-03-21 [1] CRAN (R 4.5.0)
##  purrr          * 1.0.4   2025-02-05 [1] CRAN (R 4.5.0)
##  qdapRegex        0.7.10  2025-03-24 [1] CRAN (R 4.5.0)
##  R6               2.6.1   2025-02-15 [1] CRAN (R 4.5.0)
##  rappdirs         0.3.3   2021-01-31 [1] CRAN (R 4.5.0)
##  RColorBrewer   * 1.1-3   2022-04-03 [1] CRAN (R 4.5.0)
##  Rcpp             1.0.14  2025-01-12 [1] CRAN (R 4.5.0)
##  readr          * 2.1.5   2024-01-10 [1] CRAN (R 4.5.0)
##  repr             1.1.7   2024-03-22 [1] CRAN (R 4.5.0)
##  reshape2         1.4.4   2020-04-09 [1] CRAN (R 4.5.0)
##  rlang            1.1.6   2025-04-11 [1] CRAN (R 4.5.0)
##  rmarkdown        2.29    2024-11-04 [1] CRAN (R 4.5.0)
##  rstudioapi       0.17.1  2024-10-22 [1] CRAN (R 4.5.0)
##  S7               0.2.0   2024-11-07 [1] CRAN (R 4.5.0)
##  sass             0.4.10  2025-04-11 [1] CRAN (R 4.5.0)
##  scales           1.4.0   2025-04-24 [1] CRAN (R 4.5.0)
##  sessioninfo      1.2.3   2025-02-05 [1] CRAN (R 4.5.0)
##  skimr          * 2.1.5   2022-12-23 [1] CRAN (R 4.5.0)
##  slam             0.1-55  2024-11-13 [1] CRAN (R 4.5.0)
##  SnowballC        0.7.1   2023-04-25 [1] CRAN (R 4.5.0)
##  stringi          1.8.7   2025-03-27 [1] CRAN (R 4.5.0)
##  stringr        * 1.5.1   2023-11-14 [1] CRAN (R 4.5.0)
##  sylly            0.1-6   2020-09-20 [1] CRAN (R 4.5.0)
##  sylly.en         0.1-3   2018-03-19 [1] CRAN (R 4.5.0)
##  syuzhet          1.0.7   2023-08-11 [1] CRAN (R 4.5.0)
##  textclean      * 0.9.3   2018-07-23 [1] CRAN (R 4.5.0)
##  textdata         0.4.5   2024-05-28 [1] CRAN (R 4.5.0)
##  textshape        1.7.5   2024-04-01 [1] CRAN (R 4.5.0)
##  textstem         0.1.4   2018-04-09 [1] CRAN (R 4.5.0)
##  tibble         * 3.3.0   2025-06-08 [1] CRAN (R 4.5.0)
##  tidyr          * 1.3.1   2024-01-24 [1] CRAN (R 4.5.0)
##  tidyselect       1.2.1   2024-03-11 [1] CRAN (R 4.5.0)
##  tidytext       * 0.4.2   2024-04-10 [1] CRAN (R 4.5.0)
##  tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.5.0)
##  timechange       0.3.0   2024-01-18 [1] CRAN (R 4.5.0)
##  tm             * 0.7-16  2025-02-19 [1] CRAN (R 4.5.0)
##  tokenizers       0.3.0   2022-12-22 [1] CRAN (R 4.5.0)
##  topicmodels    * 0.2-17  2024-08-14 [1] CRAN (R 4.5.0)
##  tzdb             0.5.0   2025-03-15 [1] CRAN (R 4.5.0)
##  vader          * 0.2.1   2020-09-07 [1] CRAN (R 4.5.0)
##  vctrs            0.6.5   2023-12-01 [1] CRAN (R 4.5.0)
##  withr            3.0.2   2024-10-28 [1] CRAN (R 4.5.0)
##  wordcloud      * 2.6     2018-08-24 [1] CRAN (R 4.5.0)
##  xfun             0.52    2025-04-02 [1] CRAN (R 4.5.0)
##  xml2             1.3.8   2025-03-14 [1] CRAN (R 4.5.0)
##  yaml             2.3.10  2024-07-26 [1] CRAN (R 4.5.0)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
##  * ── Packages attached to the search path.
## 
## ──────────────────────────────────────────────────────────────────────────────